Finding Young Stellar Populations in Elliptical Galaxies from 
Independent Components of Optical Spectra 



Ata Kaban* Louisa A. Nolaiv Somak Raychaudhury 



Abstract 

Elliptical galaxies are believed to consist of a single pop- 
ulation of old stars formed together at an early epoch 
in the Universe, yet recent analyses of galaxy spectra 
seem to indicate the presence of significant younger pop- 
ulations of stars in them. The detailed physical mod- 
elling of such populations is computationally expensive, 
inhibiting the detailed analysis of the several million 
galaxy spectra becoming available over the next few 
years. Here we present a data mining application aimed 
at decomposing the spectra of elliptical galaxies into 
several coeval stellar populations, without the use of de- 
tailed physical models. This is achieved by performing 
a linear independent basis transformation that essen- 
tially decouples the initial problem of joint processing 
of a set of correlated spectral measurements into that of 
the independent processing of a small set of prototypi- 
cal spectra. Two methods are investigated: (1) A fast 
projection approach is derived by exploiting the corre- 
lation structure of neighboring wavelength bins within 
the spectral data. (2) A factorisation method that takes 
advantage of the positivity of the spectra is also in- 
vestigated. The preliminary results show that typical 
features observed in stellar population spectra of dif- 
ferent evolutionary histories can be convincingly disen- 
tangled by these methods, despite the absence of input 
physics. The success of this basis transformation analy- 
sis in recovering physically interpretable representations 
indicates that this technique is a potentially powerful 
tool for astronomical data mining. 

1 Introduction 

The optical spectrum of a galaxy is a linear superpo- 
sition of the spectra of its billions of constituent stars. 
Yet, since large populations of stars form in galaxies 
at definite periods of its lifetime, and the atomic and 
nuclear physics of the evolution of stellar populations, 
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though complex, are well understood, the detailed mod- 
elling of composite spectra of stellar populations can be 
used to yield a wealth of information about the history 
of a galaxy from its spectrum. 

The spectrum of a star can be modelled as a 
function of three parameters- its mass, its age and 
its composition (since it is made mostly of hydrogen 
and helium, the last parameter is characterised by the 
relative abundance of other elements, and is known as 
the "chemical abundance"). Elliptical galaxies, which 
account for about 20% for all galaxies in the Universe, 
are believed to consist predominantly of a single coeval 
stellar population (e.g. |3J all formed at 

an early epoch in the Universe. This implies that 
an elliptical galaxy can be modelled as a system of 
stars, all of the same age and chemical abundance, 
evolving together, if validated assumptions can be made 
about the distribution of stellar masses. However, as 
a result of detailed spectral studies conducted in the 
last decade (e.g. it now transpires that elliptical 

galaxies are more complex objects, at least some of 
which have undergone more recent bursts of substantial 
star formation, and consequently are likely to contain 
more than one stellar population component. 

The determination of the star formation history 
of a galaxy has important implications for the still- 
controversial issue of the formation and evolution of 
galaxies. Until recently, the analysis of a large statistical 
sample of stellar populations of galaxies would not 
have been possible since only small ensembles of galaxy 
spectra were available. However, the development of 
data mining tools for automating parts of the analysis 
is becoming more and more essential in the light of 
the rapid increase in the availability of data that is 
approaching. 

Recent and ongoing galaxy spectral surveys (2dF- 
GRS, www.mso.anu.edu.au/2dFGRS/ (completed in 
2003) and SDSS, www.sdss.org/) will produce more 
than two million galaxy spectra in the next few years, 
which is to be integrated into a more ambitious 
database of publicly-available astronomical data, in- 
corporating Grid technology (the Virtual Observatory, 
www.ivoa.net). Since the detailed physical modelling 



of stellar populations is numerically expensive, even a 
simple question like "what fraction of elliptical galaxies 
contain a significantly younger stellar population?" will 
take years to address by conventional modelling tech- 
niques using stellar population synthesis. The timely 
extraction of useful knowledge, such as the characteris- 
tics of the star formation history (ages, chemical abun- 
dances and masses of the component stellar popula- 
tions) of galaxies, from these data will largely depend on 
developing appropriate data analysis tools that are able 
to complement more specialised astrophysical analyses. 
The astrophysical questions motivating this study 

are: 

1. Can we disentangle major stellar population com- 
ponents of elliptical galaxies without the use of de- 
tailed physical models? 

2. How do the results from a data-driven analysis of 
observed galaxy spectra correlate with the parame- 
ters of star formation history determined via a com- 
pletely independent model fitting technique used in 
astrophysics? 

These questions have not been addressed before in 
a data driven manner — that is, based on the data 
characteristics only, without any specialised physical 
input. 

1.1 Roadmap In this paper, we discuss and investi- 
gate statistical methods that attempt to solve the de- 
scribed inverse modelling problem by relating multivari- 
ate observations to lower-dimensional vectors of statisti- 
cally independent unobserved variables through the use 
of a linear model. The required statistical assumptions 
will be derived from general characteristics of the data, 
in order to employ these methods in an appropriate 
manner. 

The preliminary results presented in the next sec- 
tions are based on the data described in Section 2. A 
projection approach that exploits the correlation struc- 
ture of the spectra is presented in Section 3. In this 
approach, the required assumption for solving the in- 
verse modelling is derived from exploiting the corre- 
lation structure between neighboring wavelength bins, 
which comes naturally with spectral data. The indepen- 
dent spectral components obtained turn out to be also 
physically interpretable and exhibit typical features of 
spectra of the young and mature stellar populations. We 
then compare the results with a positivity-based single 
stage approach, presented in Section 4, that has been 
often employed for analysing spectral data in different 
domains We provide a simple probabilistic refor- 

mulation of this method that highlights its implicit as- 
sumptions, links it to the methods developed in p|] and 



also allows us to potentially incorporate measurement 
errors (if known from domain knowledge) into the algo- 
rithm. In Section 5, the results are presented, discussed 
and comparatively assessed, first in a data-driven man- 
ner and then, more importantly, from the astrophysical 
perspective. Finally, our conclusions arc summarised in 
the last section. 

2 The data and model setting 

The data we use here represent the observed optical 
spectra of 21 nearby elliptical galaxies, compiled by 
blending together 5855 measurements over the range 
2000-8000 A from various observatories on ground and 
in space. These represent the following galaxies: NGC 
0205, NGC 0224, NGC 1052, NGC 1400, NGC 1407, 
IC 1459, NGC 1553, NGC 3115, NGC 3379, NGC 
3557, NGC 3605, NGC 3904, NGC 3923, NGC 4374, 
NGC 4472, NGC 4621, NGC 4697, NGC 5018, NGC 
5102, NGC 7144 and NGC 7252. The spectra have 
been corrected for redshift (i.e. converted from their 
observed wavelengths to their emitted wavelengths), 
and the fluxes are normalised to unity in the region 
5020-5500 A. 

Since these spectra are compiled from sources with 
varying spectral coverage, the resulting data matrix has 
1453 missing values, which are first imputed using a 
KNN imputation [23 from synthetic data. We preferred 
this non-parametric procedure here, as the missing data 
mechanism may not be random — an assumption made 
by most of other imputation schemes. The validity 
of the 'missing at random' assumption in the case of 
the analysed data set will need further study, simply 
because in some wavelength regions it is consistently 
hard or impossible to take a measurement. 

In addition, for each measurement, an error value is 
also provided from known instrumental characteristics 
and uncertainty in calibration. 

2.1 How many stellar populations? According to 
existing domain knowledge, it is likely that there are 
(at least) two components of interest ^Oj- However, the 
first eigenvector explains more than 95% of the data. 
Therefore, prior to deciding that a 2D representation 
space is justified (i.e. that at the given noise levels there 
is enough useful information in the data and we are not 
attempting to model the noise in a second component), 
we perform some simple, data-driven rank tests. We use 
the error matrix to derive thresholds for these tests. As 
shown on Figure ^ both a 2-norm test and an F-norm 
test suggest that at the given error levels, the 'clean' 
matrix of spectra has rank 2. Although it is known that 
these perturbation bounds often tend to underestimate 
the rank |19| . there are no records of over-estimation, so 
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Figure 1 : Rank-tests for the matrix of stellar population 
spectra. The threshold values represent the 2-norm 
and the F-norm respectively of the matrix of known 
measurement errors. 



we proceed to searching for a suitable two-dimensional 
latent representation space. 

2.2 The model Each stellar population spectrum is 
essentially a vector over a binned wavelength range, 
that represents the flux (in arbitrary units) per unit 
wavelength in the bin centered on the wavelength. The 
data matrix, having T spectra in rows will be denoted 
by Y € lZ TxN . Single elements of this matrix will 
be referred to as yt n , the t-ih row is denoted by y t or 
more explicitly y t and likewise, the n-th column is 
denoted by y n or Vi-xn- The whole matrix Y will also 
be referred to as y^.^- Similar notational convention 
will also apply to other variables in the model. 

To account for the generation mechanism described 
in the previous section, namely that the observed optical 
spectrum of a galaxy is a linear superposition of the 
stars in the galaxy, a linear factor model will be assumed 
in this study. That is, we hypothesise that the T 
observations can be explained as a superposition of 
K < T latent underlying component spectra sj~ € TZ K 
(sometimes termed also as factors or hidden causes 
[T71 IT31 El) that are not observable directly but only 
through an unknown linear mapping A G lZ TxN . 
Formally, this can be written as the following. 



(2.1) 
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In eq. p.ljl . the first term of the r.h.s. is the so called 
systematic component and e is the noise term or the 
stochastic component of this model. The noise term e 
is assumed to be zero-mean i.i.d. Gaussian, where the 



diagonal structure of the covariance accounts for the 
notion that all dependencies that exist in Y should be 
explained by the underlying hidden components. 

The K components will be assumed statistically in- 
dependent, this being a standard assumption of inde- 
pendent factor models El ■ I Q the present applica- 
tion, this assumption is also cosmologically plausible, 
as there is little (no) interaction between stellar popu- 
lations at different ages in a galaxy. The linearity of the 
mixture is physically justified as the fluxes of the hy- 
pothesised different subpopulations mix in an additive 
way. 

3 Independent projections of stellar population 
spectra 

A projection based approach is presented in this sec- 
tion. This will be accomplished in stages. A linear 
dimensionality reduction will first be performed. We 
then proceed at identifying independent directions in 
the low-dimensional projection space. This multi-stage 
projection approach is well-suited as a first attempt. It 
allows us to formulate sub-tasks in statistical terms, and 
given the 2D nature of the problem, it also allows us to 
benefit from a visual control over the data representa- 
tion obtained at various stages. 

3.1 Dimensionality reduction using SVD Di- 
mensionality reduction is a useful preprocessing stage 
for both computational convenience and de-noising. It is 
well known from linear algebra j^^j that the best rank- 
K approximation of a matrix under any unitarily invari- 
ant norm is its rank-K SVD (Singular Value Decompo- 
sition) approximation. This is given by Y w UDV J , 
where U is the T x K matrix of left singular vectors, 
D is the K x K diagonal matrix of singular values, 
V is the N x K matrix of right singular vectors, and 
U T U = V T V = Ik, where Ik is the if-dimensional 
identity matrix. The projection is then simply obtained 
as X = U T Y. 

It should be pointed out that the scope of an SVD- 
based projection is to identify an optimal (in the sense 
of any unitarily invariant norm) subspace of the data 
space. However, generally, individual singular vectors 
or eigenvectors are not interpretable separately, as they 
are not independent from each other. The same is true 
for PC A [TJI, for much the same reasons. 

3.2 Finding non-orthogonal informative direc- 
tions using contextual ICA We now turn to the key 
part of our analysis, where the directions of indepen- 
dent projection need to be found. Approaches with this 
aim are known under the name of Independent Compo- 
nent Analysis (ICA) [TIROE!- A vast number of ICA 



algorithms have been developed over the last decade, 
each having different built-in assumptions. In general 
terms, we can write the data likelihood of the desired 
basis transformation as the following 



contextual (predictive) model. 



(3.2) 



p(x 1:N \B) = J ds 1:N p(x 1:N \s 1 .. N , B)p(s 1:N ). 



where B is the K xK unknown linear mapping (squared 
mixing matrix) that transforms the latent components 
S into X. That is, as standard in ICA, instead of 
inferring S from Y , it is easier to infer them from X. 

Assuming that the SVD projection performed as 
described in the previous section has removed the noise, 
then the noise term is a delta function 



(3.3) p(x 1:N \s 1:N , B) = 5(xi :N - Bsi 
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where B is a squared K x K parameter matrix (mixing 
matrix) that contains the desired new bases in its 
columns and Si : at are the independent representations 
in the new basis — both having to be estimated from 
the data. Thus, l|3.2|l reduces to the simple form below 
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Standard in squared ICA problems, it is easier 
to optimise for the inverse of B. That is, instead 
of the 'top-down', or 'generative' transform B, we 
estimate the 'bottom-up' or 'projection' transform B 1 . 
However, without knowing p(si : jv), this is still an ill- 
posed problem. Clearly, a mechanical application of 
any ICA algorithm, out of the hundreds of existing ones, 
would produce different results, although any of these 
would be somewhat arbitrary. What we need is a well 
motivated prior distribution p(si-n). However, as in 
most data mining applications of ICA, there is no such 
information explicitly available. 

3.2.1 Exploiting correlations within the spec- 
tral data Let us observe, however, that in spectral 
data, there is a natural correlation structure between 
flux values in neighbouring wavelength bins. This is 
what we exploit here, by capturing it in a form of a 
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n=l 

Vfc = 1 : K. The advantage of doing so is that now we 
only need to specify the form of density of the residual 
projections. Assuming a good enough predictor, then 
the residual is likely to have a heavy tailed (termed also 
super-Gaussian [7| or kurtotic) form of density. Indeed, 
using just the simplest first order predictor, which is an 
identity function 

(3.6) E[sk n \s k ,l:„-x] = Sfe,n-l,Vfc = I l K 

the difference process x n — x n -\ of the data already 
becomes highly kurtotic, as shown on Figure El 

A similar approach has been previously taken and 
successfully demonstrated in the context of face image 
separation 5 , where neighbouring pixel values of an 
image do also exhibit significant correlations. 




Figure 2: Histograms of the difference process. Kurtosis 
values are 11.0503 and 33.9364 respectively. 



Let us denote r n = x n — E[x n \xi :n —i] and u kn — 
Skn — E[skn\sk,i-.n-i}- The data likelihood is then the 
following. 

(3.7) p(x 1:N ) = IdetB- 1 ^ JJ JJp((B-Vn) 

n k 

where now we know that p(uk n ) is a super-Gaussian 
density. Maximisation of this likelihood can now be 



accomplished by employing any standard ICA algorithm 
- over n-.N rather than xi-.n- As the predictor may 
not be very accurate (we just used an identity predictor 
in our experiments), it is preferable to chose a robust 
approximation of the generalised exponential density, 
that grows relatively slowly in |wfc„|. Following the 
arguments in |7|, in our experiments we have used the 
following: 

(3.8) logp(u fe ) oc exp(-u|/2), 

and the optimisation has been performed using the New- 
ton method implemented in the FastICA routines 
employing the faster deflationary approach. This has a 
cubic convergence [J]. Indeed, highly kurtotic indepen- 
dent projections have been found (kurtosis: 33.8796 and 
12.7484 respectively) on the data investigated, in about 
ten iterations only. 

A geometric illustration of the procedure just de- 
scribed is shown on Fig. [3] The SVD-compressed data 
are shown as dots and indeed, informative directions 
would be difficult to determine directly from the data. 
The scatter-plot of the difference process is shown as 
crosses. A star-like structure is apparent, with two 
main, non-orthogonal linear directions of high data den- 
sity. These are the new bases (columns of B) that are 
determined by the ICA procedure. Indeed, the two di- 
rections defined by the new bases found by the algo- 
rithm are highlighted on the plot as dark lines. The 
PCA axes of the data are also shown on the same plot 
for comparison. Interestingly, one of the axes is almost 
identical to one of the independent directions. The sec- 
ond is, however, just orthogonal to the first, while the 
ICA axes are not orthogonal to each other but do follow 
the two main directions of high density in the data. 

To obtain the component spectra from the ICA pro- 
cedure described, we simply compute the projections of 
the individual flux values of all galaxies at all wave- 
length bins onto the new bases (which is the composi- 
tion of the two linear transforms performed during the 
analysis process described): 

(3.9) su N = Bx 1:N = Ay 1:N 

where B and A now denote the recovered mixing pa- 
rameters. Specifically, after B is found, A is computed 
as the matrix product UB. 

The component-wise reconstruction of the 21 indi- 
vidual stellar population spectra from their independent 
components are shown on Figure The physical inter- 
pretability of these components will be assessed in Sec- 
tion 5, however, as a data-driven observation, it is inter- 
esting to note that the recovered spectral components 
turned out to be positive valued — although positivity 



has not been artificially imposed at this stage during the 
analysis process. We note that indeed negative values 
of the flux would be difficult to interpret, therefore in 
the next section we discuss a different approach, where 
the required latent density p(si-jq) is derived from a 
positivity constraint. 

4 A Positivity-based approach 

The use of positive factorisation of positive matrices to 
replace PCA for analysing positive data, such as spec- 
tral data dates back to work reported in [Hj. Positive 
(more exactly non-negative) factor models have been 
further developed in |18j . However, in the absence of ei- 
ther a density-based or a geometric interpretation, the 
implicit assumptions are not clear and therefore the in- 
terpretation of the results may not be straightforward. 
Somewhat related, in |13j. a fully Bayesian formulation 
of a positive factorisation model is given and variants 
with sparse positive priors are applied to synthetic stel- 
lar population spectra in a different context of investi- 
gations than ours. Retaining the probabilistic frame- 
work, that allows us to make all assumptions explicit, 
we present a simpler version of their algorithm, based 
on maximum a posteriori (MAP) / maximum likelihood 
(ML) estimation, which will highlight the link with pos- 
itive factorisation algorithms ^Hj- A MAP or ML es- 
timation is sufficient for our purposes as we are con- 
cerned with a data explanation task for a fixed data 
set only. Once the most appropriate model is found, 
the full Bayesian machinery remains available to derive 
fully generative models that are able to better generalise 
on new data. 

Retaining the positivity of the representation, and 
the fact that the overall transform in the approach 
adopted in the last section has been linear, the following 
linear model can be formulated. 

(4.10) p(y n \A,E n ) = J ds P (y n \As,E n )p{s) 

A Gaussian measurement noise will be assumed. Fur- 
thermore, according to prior knowledge about the lev- 
els of imprecision of the physical instruments, which 
vary independently for each stellar population and each 
wavelength bin, we will have individual diagonal vari- 
ances E„ at each wavelength bin n, p(y n \s, A, E n ) ~ 
N(y n \As n , E„). In rest, we use the same notation as 
before, y n refers to the relative flux values observed at 
the n-th wavelength bin, A is the unknown mixing ma- 
trix parameter and p(s) is the distribution of the latent 
components. 

Now the latent prior needs to be specified. Apart 
from its positive support we don't have much informa- 
tion in this respect. Therefore we formulate a vague 
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Figure 3: Geometric illustration of the described ICA procedure. The number of points shown as dots equals the 
number of different wavelength bins in the set of measurements, each point being the 2D compressed representation 
of the fluxes at one of these bins. The differences between these 2D vectors at consecutive wavelength bins are 
marked with 'x'. The PC A basis of the data is superimposed (in light color) and also translated to the origin for 
comparison with the ICA basis found on the difference process. The ICA basis is shown in dark color. 



exponential prior, that is 

(4.11) p(s) oc JJcxp(-a fc „|s fc |) 

k 

where akn = a, Vfc,n is a small positive constant. If 
a — > 0, then the prior becomes non-informative but 
also improper and in this case the MAP estimation 
procedure given below becomes ML. 

To obtain a MAP estimate, the posterior needs to 
be maximised p(s n \y n , A, E n ) oc p(y n \s n ,A, E n )p(s n ) 
which is proportional to the complete data likelihood. 

Positivity of the elements of both A and S are 
then imposed by adding Lagrangian terms [2]] to the 
complete data log likelihood. 

£ = ^2{logp(y n \s n ,A,E n ) + logp(s n )} 

71 

+ TrLjA + TrL^S 

where L\ and li% are a set of non-negative Lagrange 
multipliers and Tr denotes the trace of a matrix. 

From the stationary equations w.r.t. A and S the 
Lagrange multipliers are obtained. 

£ 1 = S E n 2As nsl - E n 2 V n sl 

n n 

L 2 = J2 A T E- 2 As n - ]T A T E- 2 y n + a 

n n 

Now from the Karush-Kuhn- Tucker conditions |2()| 
LtkAtk = and L^nSkn = 0, Vt, k, n, we have two fixed 
point equations which provide the convergent alternat- 
ing iterative algorithm of the multiplicative form below. 



A = A (E~ 2 Y)S T [E~ 2 AS]S T 

S = SQA T {E- 2 QY)(Z){A T [E- 2 QAS]+a} 

where denotes element-wise multiplication and 
denotes element- wise division. If a = 0, the iterative 
algorithm above is identical to the least-squares based 
non-negative factorisation algorithm proposed in |18j 
from a non-probabilistic starting point, with the only 
difference that now we also have the E terms to account 
for known measurement errors. 

5 Evaluation 

5.1 Data driven evaluation Here we assess the 
effectiveness of the methods presented above according 
to two indicators: (1) data reconstruction and (2) the 
mutual information (MI) between the components of the 
representation created. 
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Table 1: Data reconstruction errors under the L2 norm. 
cICA = contextual ICA, NMF = Non-negative matrix 
factorisation with a=0, NMFe = NMF with exponential 
prior having a = 0.1 

Table 1 shows the data reconstruction results across 
all N x T measurements for the various methods. These 
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Figure 4: The reconstruction of stellar population spectra using the projection based contextual ICA. Here, our 
algorithm has decomposed the observed spectrum of each galaxy into two inferred populations (blue and dark 
green — the two darkest colors, in black-and-white printing), whose sum is given in red, superimposed with the 
data in light green (intermediate grey and light grey respectively, in black-and-white printing). In many cases 
(e.g. #1,9,10,18), a significant younger stellar population is found, which is confirmed by our detailed physical 
modelling. The numbering of the spectra corresponds to the enumeration order of the corresponding galaxies as 
given in Section 2. 



are measured as the squared distances between the data 
and the reconstruction, which is in accordance with the 
Gaussian noise assumption. The reconstruction error 
of cICA are identical to that of the SVD, since the 
ICA transform does not reduce the dimensionality fur- 
ther. For NMF and NMFe, 15 randomly initialised runs 
were performed and the one with the highest likelihood 
value was selected in this evaluation, in order to avoid 
the possibility of getting trapped in a local optimum. 
Indeed, note that the positivity-based single stage ap- 
proach involves a non-convex optimisation whereas the 
SVD-based preprocessing is a convex problem. The me- 
dian of the error appears to be the smallest in the case 
of SVD+cICA, however, a pairwise application of the 
non-parametric Wilcoxon rank sum test to the whole 
sample distribution of the individual T x N reconstruc- 
tion errors returned that the difference of the medians 
for cICA and NMF is not statistically significant at the 
0.05% level. In the case of all other pairs, the differences 
between medians were found statistically significant — 
as expected, the additional term that enforces the la- 
tent prior distribution in NMFe causes a slight increase 
in data reconstruction error. However, this small dif- 
ference on its own would not be a practically concern- 
ing issue here, as we deliberately formulated constraints 
in our models in order to obtain representations that 
are statistically independent to a higher degree — in 
the hope that these may then be interpretable and ca- 
pable of being independently further processed. More 
interesting is therefore to evaluate to what extent do 
these methods achieve statistically independent repre- 
sentations. 

The information theoretic quantity that measures 
the degree of statistical dependence is the Mutual In- 
formation (MI) JI]. This is a non- negative value that 
vanishes if the component densities are perfectly inde- 
pendent (the smaller the MI the better). It can be 
shown ^3] that maximising the ICA log likelihood in 
the noise-free case is equivalent to minimising MI be- 
tween the representation components. Here we compute 
the sample-based MI of the components, as estimated 
according to the procedure described in |2J. The com- 
parative values obtained by various methods for this 
data are shown in Table 2. For the contextual ICA 
method, that works on predictive residuals, two values 
are given: the first value is computed from the projec- 
tions of the residuals whereas the second value is com- 
puted from the projections of the data, taken as it was 
iid. (just for obtaining a value to compare with those 
obtained with the rest of the methods). It is apparent 
from the table that in both cases the contextual ICA 
model achieves lower MI (greater independence of the 
components). The positivity-based method with non- 



Method 


MI 


NMF 


0.5923 


NMFe 


0.6019 


cICA 


(u) 0.00010394 




(s) 0.5583 


PCA 


1.0931 



Table 2: Sample-based mutual information estimates of 
the two components obtained with the various methods 
for the set of stellar population spectra investigated. 
Smaller values signify higher independence achieved 
after the estimated basis transform. 



informative improper prior is the next best performing 
method, whereas employing sparse priors (greater a val- 
ues) does not lead to components that are more inde- 
pendent, for this data. PCA is used as a baseline, as 
we know that it doesn't produce an independent rep- 
resentation. Visually, the positivity based components 
do not look much different from the ones obtained from 
cICA (Fig. 0J, although they are slightly more noisy. 
The PCA-based decomposition, however, in general, are 
theoretically not guaranteed to be interpretable, as al- 
ready discussed in Section 3.1 (see for details). We 
now turn to evaluate the interpretability of the obtained 
results from the astrophysical perspective. 

5.2 Astrophysical evaluation Here, we compare 
the results from the linear independent basis transfor- 
mation analysis with an entirely independent determi- 
nation of the star formation history, based on detailed 
astrophysical models of the evolution of stellar popula- 
tions. 

The 21 observed spectra have been analysed by 
matching them with synthetic stellar population spec- 
tra. For each of the observed spectra, a two-stellar 
population component model |S] was fitted. The age, 
chemical abundance and relative mass fraction of each 
component were allowed to vary freely. The best fit in 
each case was determined by a minimum \ 2 test |15llllj . 

The principle for creating synthetic stellar popu- 
lation spectra is simple, although the input physics is 
complex. The spectral energy distribution of a star 
evolves according to its initial mass and chemical abun- 
dance. If the initial mass distribution and the chemical 
abundance of a stellar population is known, and the 
spectral evolution of each individual star in this initial 
population may be modelled, the stellar spectra may be 
summed over the mass distribution at any point in time 
to give the integrated spectrum of the population at that 
age. The ingredients for a stellar population model are 
therefore: stellar evolutionary tracks; a library of stellar 
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Figure 5: Comparison of the derived components with physical models of the stellar population spectra. Top two 
plots: Synthetic stellar population spectra according to the physical models of Right: Spectra of a population 
of age 10 Gyr, where chemical abundance, from bottom to top, 0.2, 1.0 and 2.5 times solar; Left: age = 0.7 Gyr, 
same chemical abundances. The dotted lines mark some of the absorption features in the spectrum which are 
typically strong in young stellar populations, and the dashed lines mark some of the absorption features which 
are typically strong in old stellar populations. From left to right, the absorption line species are: Mgll (2799 A), 
He (3970 A), FF5 (4102 A), H 7 (4340 A), H/3 (4861 A), Mgb (5175 A), NaD (5893 A), Ha (6563 A), TiO (7126 
A). Bottom two plots: the 2 components found from the various different linear independent basis transformation 
analyses, from bottom to top: cICA, NMF, NMFe, PCA. (The spectra are shifted along the vertical axis for 
the sake of clarity.) The recovered spectra are convincingly disentangled into one component with young stellar 
population features (Mgll, He, H<5, H7, H/3, Ha: dotted lines) and shape, and a second with the features (Mgb, 
NaD, TiO: dashed lines) and shape of an old, high chemical abundance stellar population. 



spectra; a method of calibrating the theoretical lumi- 
nosity and effective temperature, determined from the 
evolutionary sequence, so that the appropriate atmo- 
sphere may be assigned to each star at each time-step 
in its evolution. In contrast, the linear independent ba- 
sis transformation method employs no knowledge of the 
underlying physics in the observed spectra. 

Model spectra [5] for a young (0.7 Gyr) and an old 
(10 Gyr) stellar population, with three different chemi- 
cal abundances (0.2, 1.0 and 2.5 times solar abundance) 
are shown in Fig. together with the spectra recov- 



ered from the linear independent basis transformation 
analyses. The similarity is most apparent. The data- 
driven linear analyses, whilst not reproducing any of the 
physical model spectra precisely (which would not be 
expected anyway) , extract many of the important iden- 
tifying characteristics of these two categories of model 
spectra, which are indeed quite different from each other 
both in their overall shape and details. Some of the 
most important features, the so-called absorption- lines, 
are marked with dashed (Mgb, NaD and TiO, typically 
strong features in old, high chemical abundance stel- 



lar population spectra) and dotted (Mgll, He, H<5, H7, 
H/3 and Ha, typically strong in the spectra of young 
1 Gyr) stellar populations) vertical lines on the figure 
(Fig.[SJ), which demonstrate the physical interpretability 
of the representations created by the basis transforma- 
tion analyses. 

We correlate the star formation history parameters 
derived from fitting the two-component model spectra 
to the observed spectra (i.e. a physical analysis ap- 
proach) with the weight of the contributions from the 
linear basis transformation analyses, by defining these 
weights as Ck — a tk/J2k' atk ' f° r an y gi ven spectrum t. 
Here, atk is the (t, fc)-th element of the matrix A of the 
new basis and k = 1 : 2. Fig. shows the results of the 
correlations, and Fig. [7| graphically shows some of these 
correlations. 

From Figs. and we can conclude that, for 
the ICA and NMF analyses, cl (and hence also c2, 
as ci + C2 = 1) correlates with the proportion of 
young (5 1 Gyr) stellar population component present 
in the observed spectrum, regardless of their chemical 
abundance. 

6 Conclusions 

We have presented a scientific data mining application 
that searches for linear independent basis transforma- 
tions of galaxy spectra to find the spectra of individual 
stellar populations characterised by age and chemical 
abundance. We have shown that characteristic stel- 
lar population components of elliptical galaxies can be 
disentangled from the observed spectra of these galax- 
ies, without the use of detailed physical models. The 
components returned by the linear basis transforma- 
tion analyses are clearly physically interpretable, with 
one component displaying the shape and many of the 
absorption-line features typical of a young stellar pop- 
ulation, and the second component having the over-all 
shape and typical absorption features of an old, high 
chemical abundance stellar population. The weights of 
the contributions from the linear basis transformation 
analyses correlate well with both the ages of the younger 
stellar populations and the mass fractions of the smaller 
stellar populations determined from the (completely in- 
dependent) detailed physical modelling of the observed 
galaxy spectra. 

The computational demand of the projection ap- 
proach presented is essentially that of the SVD compu- 
tation, so it is expected that the method is easily ap- 
plicable to large sets of measurements as they become 
available. The positivity based approach, per iteration, 
has a comparable scaling, however, in all our experi- 
ments the number of iterations to convergence was of an 
order of magnitude larger for the positivity based single 



stage approach. Further study is necessary to investi- 
gate models with other types of positively supported 
priors as well as refining the best performing models 
and algorithms to be able to deal with previously un- 
seen data. 

The use of the data analysis presented in this 
paper, integrated with the more complex process of 
astrophysical analysis will be detailed elsewhere Hi . 
We intend to investigate the effectiveness of these data- 
driven methods on larger sets of UV-optical spectra 
as they become available, where more comprehensive 
statistical evaluation will be possible. From the analysis 
of large archives of galaxy spectra using this technique, 
we hope to address some of the fundamental questions 
in astrophysics, that of when and how galaxies form and 
evolve. 
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Figure 7: Scatter-plots showing the correlation of (top) the age of the younger stellar population and (bottom) 
the mass fraction of the smaller stellar population determined from the model spectra fitting with the weight of 
the first component of the various linear basis transformation analyses (cl). A high value (low for PCA) of cl 
clearly corresponds to a substantial young (S 1 Gyr) stellar population. 



